Noise Robust Voice Activity Detection
نویسنده
چکیده
Voice activity detection (VAD) is a fundamental task in various speech-related applications, such as speech coding, speaker diarization and speech recognition. It is often defined as the problem of distinguishing speech from silence/noise. A typical VAD system consists of two core parts: a feature extraction and a speech/ non-speech decision mechanism. The first part extracts a set of parameters from the signal, which are used by the second part to make the final speech/non-speech decision, based on a set of decision rules. Most VAD features proposed in the literature exploit the discriminative characteristics of speech in different domains, which can be divided into five categories: energy-based features, spectral-domain features, cepstral-domain features, harmonicity-based features, and long-term features. Energy-based features are simple and can be easily implemented in hardware. Spectral-domain and cepstral-domain features are more noise robust at low SNRs, as they are beneficial from a wide class of filtering and speech analysis techniques in these domains. When SNR is around 0 dB, or when the background noise contains complex acoustical events, features relying on the harmonic structure of voiced speech, as well as ones that exploit the long-term variability of speech appear to be more robust. Next, the second part of VAD decides the speech or non-speech class for each signal segment. Existing decision making mechanisms can be divided into three categories: thresholding, statistical modelling and machine learning. The first one is the simplest, yet sufficient in many cases where the features employed possess a good discriminative power. The latter two can work well at high SNRs, but their performance decline quickly at lower SNRs. In order to derive a state-of-the-art VAD algorithm, a comparative study has been carried out in this thesis to evaluate different VAD techniques. Traditionally, VAD algorithms are evaluated as a holistic system, from which it is hard to analyse whether i performance gain is achieved from a new feature or a new decision mechanism. In this report, the author examines the use of Pe, the probability of error of two given distributions, to measure performance of a VAD feature separately from other modules in the system. The metric represents the discriminative power of a feature when used for classifying speech and non-speech. The result is a fairer comparison and a more compact performance representation. This allows a deeper analysis of VAD features, which reveals interesting trends across different SNRs. Secondly, a new approach to VAD is proposed in this report, which tackles the cases where SNR can be lower than 0 dB and background might contain complex audible events. The proposed idea exploits the sub-regions of the speech noisy spectrum that still retain a sufficient harmonicity structure of the human voiced speech. This allows for a more robust feature, based on the local harmonicity of the spectral autocorrelation of the voiced speech, can be derived to reliably detect the heavily corrupted voiced speech segments. Experimental results showed a significant improvement over a recently proposed method in the same category.
منابع مشابه
A New Algorithm for Voice Activity Detection Based on Wavelet Packets (RESEARCH NOTE)
Speech constitutes much of the communicated information; most other perceived audio signals do not carry nearly as much information. Indeed, much of the non-speech signals maybe classified as ‘noise’ in human communication. The process of separating conversational speech and noise is termed voice activity detection (VAD). This paper describes a new approach to VAD which is based on the Wavelet ...
متن کاملA Robust Voice Activity Detection Based on Noise Eigenspace Projection
A robust voice activity detector (VAD) is expected to increase the accuracy of ASR in noisy environments. This study focuses on how to extract robust information for designing a robust VAD. To do so, we construct a noise eigenspace by the principal component analysis of the noise covariance matrix. Projecting noise speech onto the eigenspace, it is found that available information with higher S...
متن کاملStudy of noise robust voice activity detection based on periodic component to aperiodic component ratio
This paper describes a study of noise robust voice activity detection (VAD) utilizing the periodic component to aperiodic component ratio (PAR). Although environmental sound changes dynamically in the real world, conventional noise robust features for VAD are sensitive to the non-stationarity of noise, which yields variations in the signal to noise ratio, and sometimes requires apriori noise po...
متن کاملA robust voice activity detector for wireless communications using soft computing
Discontinuous transmission based on speech/pause detection represents a valid solution to improve the spectral efficiency of new-generation wireless communication systems. In this context, robust Voice Activity Detection (VAD) algorithms are required, as traditional solutions present a high misclassification rate in the presence of the background noise typical of mobile environments. This paper...
متن کاملRobust voice activity detection based on the entropy of noise-suppressed spectrum
A novel noise robust voice activity detection approach is introduced. The novelty of the method that it uses noise suppressed spectrum of the input signal for spectral entropy calculation. As a result excellent end-pointing performance is observed based on predefined global entropy threshold and time constraints. The effect of frame dropping controlled by the proposed algorithm was investigated...
متن کامل端點偵測技術在強健語音參數擷取之研究 (Study of the Voice Activity Detection Techniques for Robust Speech Feature Extraction) [In Chinese]
The performance of a speech recognition system is often degraded due to the mismatch between the environments of development and application. One of the major sources that give rises to this mismatch is additive noise. The approaches for handling the problem of additive noise can be divided into three classes: speech enhancement, robust speech feature extraction, and compensation of speech mode...
متن کامل